Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
9352
posts in
13.1
ms
Fast
Heterogeneous
Serving: Scalable Mixed-Scale LLM Allocation for
SLO-Constrained
Inference
🧠
LLM
arxiv.org
·
5d
Introducing
dotLLM
- Building an LLM
Inference
Engine in C#
🧠
LLM
kokosa.dev
·
12h
·
Hacker News
amitshekhariitbhu/llm-internals
: Learn LLM
internals
step by step - from tokenization to attention to inference optimization.
🧠
LLM
github.com
·
1d
·
Hacker News
I-DLM
:
Introspective
Diffusion Language Models
🧠
LLM
introspective-diffusion.github.io
·
20h
·
Hacker News
,
r/LocalLLaMA
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
🧠
LLM
pub.towardsai.net
·
5d
Stop
benchmarking
inference
providers
, a guide to easy evaluation
🤖
Large Language Models
huggingface.co
·
13h
·
r/LocalLLaMA
Model API Performance
🤖
Large Language Models
news.ycombinator.com
·
18h
·
Hacker News
Quantization
,
LoRA
, and the 8% Problem: Benchmarking Local LLMs for Production AI
💬
LLMs
walsenburgtech.com
·
3d
·
Hacker News
LLM
inference
,
optimized
for your Mac
✍️
Prompt Engineering
omlx.ai
·
4d
·
Hacker News
LLM inference engine
written
ground-up
natively
in C#/.NET
🧠
LLM
dotllm.dev
·
11h
·
Hacker News
Token-Budget-Aware
Pool
Routing
for Cost-Efficient LLM Inference
💬
LLMs
arxiv.org
·
1d
patilyashvardhan2002-byte/lazy-moe
: The GPU-free LLM inference engine. Combines lazy expert loading +
TurboQuant
KV compression to run models that shouldn't fit on your hardware. Built from scratch, fully local, zero cloud.
💾
Bytecode
github.com
·
2d
·
r/LocalLLaMA
Four Reasons Why
FPGAs
Hit the
Sweet
Spot for LLM Inference
🏗️
RISC-V
pub.towardsai.net
·
13h
Watt
Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU
Architectures
💬
LLMs
arxiv.org
·
2d
A-IO: Adaptive Inference
Orchestration
for Memory-Bound
NPUs
🔬
eBPF
arxiv.org
·
1d
StreamServe
: Adaptive Speculative Flows for Low-Latency
Disaggregated
LLM Serving
📨
Event-Driven Architecture
arxiv.org
·
1d
Blink: CPU-Free LLM Inference by
Delegating
the Serving Stack to GPU and
SmartNIC
🧠
LLM
arxiv.org
·
5d
QCFuse
:
Query-Centric
Cache Fusion for Efficient RAG Inference
🎯
Retrieval Systems
arxiv.org
·
2d
MEMENTO
:
Teaching
LLMs to Manage Their Own Context
💬
LLMs
arxiv.org
·
1d
Scheduling the
Unschedulable
:
Taming
Black-Box LLM Inference at Scale
🔬
eBPF
arxiv.org
·
6d
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help